Analysis of Quality for White Wine

This report explores a dataset of white wines containing qualities and attributes for about 4900 wines.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

Our dataset consists of 13 variables and 4898 observations. Fist I created a bar chart for every variable to see the distribution of the data. For all variables except quality (since it is an ordered variable) I also created a scatterplot and a boxplot.

The quality of wines seems to be normally distributed in this dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The fixed acidity is normally distributed with a minimum of 3.9 g/dm³ and a maximum of 14.2 g/dm³. The mean and the median are pretty close to each other. The maximum seems to be an outlier, as well as some other values between 10 and 12.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The volatile acitidy is positively skewed, with most values lower than 0.4 g/dm³. There are some outliers, but we will investigate into this later.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

This plot surprises me. It seems to be a perfect normal distribution with just some outliers, however there is another peak at 0.49.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The distribution for residual sugar seem kind of random, and the maximum is an extreme outlier with 65.8.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

When it comes to chlorides, there is a huge skew to the right as well with a lot of outliers

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The variable free.sulfur.dioxide has a huge skew to the right as well. There is a big outlier at 289.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

total.sulfur.dioxide is almost normally distributed, if you ignore some of the outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density seems to be normally distributed, if it wasn’t for the outlier, so it is positively skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The distribution of pH is perfectly normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The distribution for sulphates has a light positive skew. The outliers are not to far away from the mean and the median

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol follows a right skewed distribution.

What is the structure of your dataset?

There are 4898 types of wine in this dataset with 13 features. The only ordered factor variable is “quality”.

All the other variables describe physical and chemical properties of the wine.

What is/are the main feature(s) of interest in your dataset?

The main feature of iterest in this dataset is the quality of wine. I’d like to find out, which variables influence the quality.

Of the features you investigated, were there any unusual distributions?

The only unusual distribution was the one for sugar. There was an extreme outlier and it didn’t follow any usual distribution at all.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

Here are some of the observations I made from the correlation table and the plot:

The quality is positively correlated with alcohol and negativeliy correlated with density. A negative correlation exists between pH and fixed acidity, which seems logic to me. There is a strong correlation between density and residual sugar. We have already seen, that the quality of white wine seems to decrease with increasing density. So since density and residual sugar have a strong positive correlation, I expected the amount of sugar had a negative impact on white wine quality too. But there seems to be no correlation between sugar and quality at all.

Free sulfur dioxide and total sulfur dioxide have a strong correlation.

The plot looks kind of what I expected from the correlation value. Alcohol has a great impact on the quality of white wine.

Even if we eliminate the most extreme outliers, the plot does not show any clues that the fixed acidity had an effect on the quality. The mean values are almost the samefor all quality values.

Volatile acidity also seems to have no effect on the quality. The only thing we seen here is a higher mean for category 4 quality wines. All the other mean values do not differ too much.

I eliminated some extreme outliers here as well. Also citric acid seems to have ne effect on the quality. The mean value for citric acid is a little bit higher for very high quality wines as well as for very low quality wines. But I can’t draw any conclusions from this.

Chlorides seem to have a small effect on the quality of white wine. Higher quality wines seem to have a lower chlorides value.

Free sulfur dioxide seems to have no effect on the quality of wine as well. The only thing that strikes the eye is the significantly lower mean value for quality 4 category white wines.

This plot looks not too different from the previous one and it absolutely makes sense since both of those variables are correlated as we have seen before.

If you look at the regression line one might think that sugar might have an effect on the quality of white wines but if I look at the distribution of the values at all, I am not really sure about that any more.

This proves what we have assumed before, that higher quality wines tend to have a lower density. For very high quality wines there is almost no scatter for the density values.

I assumed that density might be influenced by the amount of sugar and we see in this plot, that this is the case even if one might not assume it from the correlation table.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

From the correlation table I thought that density and chlorides might be influenced by each other. If we leave aside the chloride values greater than 0.75, this might even be true.

The pH-Value goes up for higher quality white wines.

I wanted to know how fixed acidity, volatile acidity and citric acid influence the pH-value it measures acidity on some way. And it looks like volatile acidity has no effect on pH, but fixed acidity and citric acid do. When tey increase, pH decreases.

The mean values for sulphates are almost the same for all quality categories. There seems to be no effect on the quality of wine.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The quality of white wine is stronlgy correlated with the amount of alcohol. As alcohol increases, the quality increases as well.

As the density increases, the quality of wine decreases.

Higher quality wines tend to have a higher pH-Value.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I observed the relationship between residual sugar and density and found out when one of them increases, the other one increases as well.

Also I looked at the relationships between pH and volatile acidity/fixed acitidy/citric acid. Volatile acidity had no effect on pH, fixed acidity and citric acid did. As they increase, pH decreases.

What was the strongest relationship you found?

The strongest relationship was between alcohol and quality.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

THis plot proves what we already found out earlier. White wines have a better quality if they have high pH and more alcohol.

Since there was a correlation between pH and fixed acidity, I wanted to find out how fixed acidity effects the quality. It can be seen that with high fixed acidity, the quality goes down. But I can’t identify any trend from this.

We see that high quality wines tend to have less sugar than low quality wines.

We have seen the correlation between density and chlorides earlier and now we see, that high chlorides values lead to a lower quality. And in this plot it can be clearly seen, that high quality wines have a low density and very few chlorides.

This plot is really interesting. It shows that there is a connection bewteen alcohol and chlorides. Much alcohol goes along with few clorides. It is also interesting to look at the correlation lines. For higher quality wines the gradient is much more extreme than for lower quality ones.

This plot is surprising. We can see that for all wines there is a negative correlation between alcohol and free sulfur dioxide except for very high quality wines. There we seem to have a positive correlation.

Here we see that with increasing amount of sugar, the amount of alcohol goes down. And if alcohol really is the strongest indicator for quality, sugar indeed seems to have a negative effect on the quality indirectly. Maybe sugar might bind alcohol?

Here we see that higher quality wines have both less chlorides and less sugar.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Many of the plots in this section confirmed, that high quality wines tend to have more alcohol.

We also have seen that pH seems to effect the quality as well, since high quality wines have higher pH. They also tend to be less dense.

Chlorides corrlate positively with density and negatively with alcohol. So chlorides seem to effect the quality after all. High quality seems to go along with few chlorides.

More suger seems to lead to higher density, which seems logical. So we can say that you can influence the quality of wine negatively by adding sugar. Also alcohol goes down with more sugar.

Were there any interesting or surprising interactions between features?

The most surprising interaction was between alcohol and free sulfur dioxide by quality. Only for very high quality wines these two variables were positively correlated, for all the others wines the correlation was negative.

What was interesting as well was to see the correlation between alcohol and chlorides how it differs by quality.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

## 
##     1     2     3     4     5     6     7 
## 0.004 0.033 0.297 0.449 0.180 0.036 0.001
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Description One

There is a very small amount of good wines in this dataset. Only 3.6% of wines in this dataset are of high quality (6 or 7). It is interesting to see, that the distribution of the qualities is normal. But it might be hard to tell, what makes a very low or very high quality wine from this because there are only few datapoints.

Plot Two

Description Two

Alcohol has the strongest correlation with quality. It seems to be the main feature that indicates the quality of white wine.

Plot Three

Description Three

Alcohol seems to have the greatest impact on white wine quality. Since sugar influences density and density influences quality, I chose this plot to show how you can influence the quality of wine with these variables. You can produce higher quality wines by reducing sugar and increasing alcohol.


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

The white wine dataset contains information on alsom 4900 white wines. I started exploring the dataset by first looking at each variable individually. A combined view of a barchart and scatter-/boxplots proved very useful for this purpose.

In the bivariate plots section I started with a correlation table to see, which variables might make sense taking into account for the further exploration. There was a clear trend between alcohol and quality. Also there were some minor trends between alcohol on the one hand and pH and density on the other hand. It was interesting to see, how chlorides and residual sugar correlated with density. So I assumed that they might influence the quality of wine at some point.

The multivariate plots section showed some interesting and also surprising trends. The greatest surprise was the correlation between alcohol and free sulfur dioxide, though I couldn’t draw any conclusions, if it might effect the quality somehow.

To me it seems that alcohol, density, chlorides residual.sugar and pH seem to fit best to tell the quality of white wine. It didn’t look like fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide and sulphates could be useful to tell the quality.

One limitation this dataset has is the limited amount of very low and very high quality wines. So for both of these groups, my observations might not be representative at all. This is some kind of a struggle. For future observations it might be useful to find some more data about wines from those two categories.

There was one point that made me run into difficulties and that was the extreme amount of extreme outliers. I had no idea, if they just showed a correct value of an extreme case or if they were false values. The outliers distort the plots in a way, that I had to exclude them from some of my observations.